Scaling Spark PG DB Beyond 500 GiB

Ideas:

Constraints:

  • If we run our DB in a different cloud provider than Fly.io, then we need to pay for bandwidth between spark services running on Fly.io and the DB server.
  • We can migrate spark service too, but that’s even more effort.

Migrate our DB to (hosted platform)

  • AWS RDS Aurora
    • Estimated cost at current usage: 316$/month
  • CockroachDB is too expensive

Use web3.storage for historical measurements

👉🏻 Pull request: https://github.com/filecoin-station/spark-api/pull/182

Roll up published_as CIDs into a new table and remove them from the table (keep only measurements for the last X hours.)

Downside:

  • Before we can run any more analysis on the SQL data, we need to build a tool to fetch measurements from web3.storage.
  • The data migration is going to be very CPU intensive and will take a long time to complete. We need to run it manually outside of the usual deployment/auto-migration process.

Algorithm:

  • In spark-publish, after we publish CID to the smart-contract, insert a new row (contract_address, round_index, cid)
  • Periodically, delete all measurements that have published_at field set and are older than X hours.
  • Initial migration:

    For historical measurements where we don’t know the exact round index they were published in (because this round index can be different from what spark-api filled in the measurements table), assign their published_as CID to a “virtual” contract and round index, e.g. contract_address=0xUNKNOWN and round_index=0

Improvement:

  • To make it easier to filter CIDs based on date ranges, we can add a timestamp field to each (contract_address, round_index, cid) row.
  • For historical measurements, we can set this value as MAX(published_as) + INTERVAL '3 minutes' aggregated across all measurements published in the same batch.

Expected savings:

  • If each commitment contains ~50k measurements 600 bytes each, we will add 100 bytes per commitment and delete 30MB. We are publishing a new commitment every ~3 minutes, i.e. ~480 commitments per day = 14.4GB/day.

Compress data

https://medium.com/@lk.snatch/postgresql-compression-854a4647ee43

  • Postgres Pro
  • Compression file system